Lecture 23
Original data from Antonio, Almeida, and Nunes (2019), Data dictionary
Rows: 50,000
Columns: 23
$ hotel <fct> City_Hotel, City_Ho…
$ lead_time <dbl> 217, 2, 95, 143, 13…
$ stays_in_weekend_nights <dbl> 1, 0, 2, 2, 1, 2, 0…
$ stays_in_week_nights <dbl> 3, 1, 5, 6, 4, 2, 2…
$ adults <dbl> 2, 2, 2, 2, 2, 2, 2…
$ children <fct> none, none, none, n…
$ meal <fct> BB, BB, BB, HB, HB,…
$ country <fct> DEU, PRT, GBR, ROU,…
$ market_segment <fct> Offline_TA/TO, Dire…
$ distribution_channel <fct> TA/TO, Direct, TA/T…
$ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0…
$ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0…
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0…
$ reserved_room_type <fct> A, D, A, A, F, A, C…
$ assigned_room_type <fct> A, K, A, A, F, A, C…
$ booking_changes <dbl> 0, 0, 2, 0, 0, 0, 0…
$ deposit_type <fct> No_Deposit, No_Depo…
$ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0…
$ customer_type <fct> Transient-Party, Tr…
$ average_daily_rate <dbl> 80.75, 170.00, 8.00…
$ required_car_parking_spaces <fct> none, none, none, n…
$ total_of_special_requests <dbl> 1, 3, 2, 1, 4, 1, 1…
$ arrival_date <date> 2016-09-01, 2017-0…
Our goal is to develop a predictive model that is able to predict whether a booking will include children or not based on the other characteristics of the booking.
# A tibble: 2 × 3
children n prop
<fct> <int> <dbl>
1 children 3027 0.0807
2 none 34473 0.919
# A tibble: 2 × 3
children n prop
<fct> <int> <dbl>
1 children 1011 0.0809
2 none 11489 0.919
holidays = c("AllSouls", "AshWednesday", "ChristmasEve", "Easter",
"ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")
lr_recipe = recipe(children ~ ., data = hotel_train) %>%
step_date(arrival_date) %>%
step_holiday(arrival_date, holidays = holidays) %>%
step_rm(arrival_date) %>%
step_rm(country) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
lr_recipeRecipe
Inputs:
role #variables
outcome 1
predictor 22
Operations:
Date features from arrival_date
Holiday features from arrival_date
Variables removed arrival_date
Variables removed country
Dummy variables from all_nominal_predictors()
Zero variance filter on all_predictors()
# A tibble: 37,500 × 76
lead_time stays_…¹ stays…² adults is_re…³ previ…⁴ previ…⁵ booki…⁶ days_…⁷ avera…⁸ total…⁹ child…˟
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 2 0 1 2 0 0 0 0 0 170 3 none
2 95 2 5 2 0 0 0 2 0 8 2 none
3 67 2 2 2 0 0 0 0 0 49.1 1 none
4 47 0 2 2 0 0 0 0 0 289 1 childr…
5 56 0 3 0 0 0 0 0 0 82.4 1 childr…
6 6 2 2 2 0 0 0 0 0 180 1 childr…
7 130 1 2 2 0 0 0 0 0 71 0 none
8 27 0 1 1 0 0 0 0 0 120. 1 none
9 46 0 2 2 0 0 0 0 0 162 0 none
10 423 1 1 2 0 0 0 0 0 122. 1 none
# … with 37,490 more rows, 64 more variables: arrival_date_year <int>, arrival_date_AllSouls <int>,
# arrival_date_AshWednesday <int>, arrival_date_ChristmasEve <int>, arrival_date_Easter <int>,
# arrival_date_ChristmasDay <int>, arrival_date_GoodFriday <int>, arrival_date_NewYearsDay <int>,
# arrival_date_PalmSunday <int>, hotel_Resort_Hotel <dbl>, meal_FB <dbl>, meal_HB <dbl>,
# meal_SC <dbl>, meal_Undefined <dbl>, market_segment_Complementary <dbl>,
# market_segment_Corporate <dbl>, market_segment_Direct <dbl>, market_segment_Groups <dbl>,
# market_segment_Offline_TA.TO <dbl>, market_segment_Online_TA <dbl>, …
══ Workflow ════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────
6 Recipe Steps
• step_date()
• step_holiday()
• step_rm()
• step_rm()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
══ Workflow [trained] ══════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────
6 Recipe Steps
• step_date()
• step_holiday()
• step_rm()
• step_rm()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept)
-2.543e+02
lead_time
-1.287e-03
stays_in_weekend_nights
5.231e-02
stays_in_week_nights
-3.433e-02
adults
7.328e-01
is_repeated_guest
3.962e-01
previous_cancellations
2.147e-01
previous_bookings_not_canceled
3.728e-01
booking_changes
-2.396e-01
days_in_waiting_list
6.415e-03
average_daily_rate
-1.049e-02
total_of_special_requests
-4.936e-01
arrival_date_year
1.344e-01
arrival_date_AllSouls
1.006e+00
arrival_date_AshWednesday
2.019e-01
arrival_date_ChristmasEve
5.328e-01
arrival_date_Easter
-9.749e-01
arrival_date_ChristmasDay
-6.875e-01
arrival_date_GoodFriday
-1.593e-01
arrival_date_NewYearsDay
-1.185e+00
arrival_date_PalmSunday
-6.243e-01
hotel_Resort_Hotel
9.581e-01
meal_FB
-6.348e-01
...
and 110 more lines.
# A tibble: 37,500 × 4
children .pred_class .pred_children .pred_none
<fct> <fct> <dbl> <dbl>
1 none none 0.0861 0.914
2 none none 0.0178 0.982
3 none none 0.0101 0.990
4 children children 0.931 0.0693
5 children none 0.473 0.527
6 children none 0.144 0.856
7 none none 0.0710 0.929
8 none none 0.0596 0.940
9 none none 0.0252 0.975
10 none none 0.0735 0.926
# … with 37,490 more rows
Truth
Prediction children none
children 1075 420
none 1952 34053
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.719
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.881
lr_test_perf = lr_fit %>%
augment(new_data = hotel_test) %>%
select(children, starts_with(".pred"))
lr_test_perf %>%
conf_mat(children, .pred_class) Truth
Prediction children none
children 359 137
none 652 11352
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.724
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.864
For this we will be using the glmnet package which supports fitting lasso, ridge and elastic net models.
Collection of 1 parameters for tuning
identifier type object
penalty penalty nparam[+]
Lasso (and Ridge) models are sensitive to the scale of the model features, and so a standard approach is to normalize all features before fitting the model.
# A tibble: 37,500 × 76
lead_time stays_…¹ stays…² adults is_re…³ previ…⁴ previ…⁵
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.858 -0.938 -0.767 0.337 -0.213 -0.0597 -0.112
2 0.160 1.09 1.32 0.337 -0.213 -0.0597 -0.112
3 -0.146 1.09 -0.245 0.337 -0.213 -0.0597 -0.112
4 -0.365 -0.938 -0.245 0.337 -0.213 -0.0597 -0.112
5 -0.267 -0.938 0.278 -3.59 -0.213 -0.0597 -0.112
6 -0.814 1.09 -0.245 0.337 -0.213 -0.0597 -0.112
7 0.544 0.0735 -0.245 0.337 -0.213 -0.0597 -0.112
8 -0.584 -0.938 -0.767 -1.63 -0.213 -0.0597 -0.112
9 -0.376 -0.938 -0.245 0.337 -0.213 -0.0597 -0.112
10 3.75 0.0735 -0.767 0.337 -0.213 -0.0597 -0.112
# … with 37,490 more rows, 69 more variables:
# booking_changes <dbl>, days_in_waiting_list <dbl>,
# average_daily_rate <dbl>,
# total_of_special_requests <dbl>, children <fct>,
# arrival_date_year <dbl>, arrival_date_AllSouls <dbl>,
# arrival_date_AshWednesday <dbl>,
# arrival_date_ChristmasEve <dbl>, …
══ Workflow ════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────
7 Recipe Steps
• step_date()
• step_holiday()
• step_rm()
• step_rm()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = 1
Computational engine: glmnet
# 5-fold cross-validation using stratification
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [30000/7500]> Fold1
2 <split [30000/7500]> Fold2
3 <split [30000/7500]> Fold3
4 <split [30000/7500]> Fold4
5 <split [30000/7500]> Fold5
( lasso_grid = lasso_work %>%
tune_grid(
hotel_vf,
grid = tibble(
penalty = 10^seq(-4, -1, length.out = 10)
),
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc)
)
)# Tuning results
# 5-fold cross-validation using stratification
# A tibble: 5 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [30000/7500]> Fold1 <tibble> <tibble> <tibble>
2 <split [30000/7500]> Fold2 <tibble> <tibble> <tibble>
3 <split [30000/7500]> Fold3 <tibble> <tibble> <tibble>
4 <split [30000/7500]> Fold4 <tibble> <tibble> <tibble>
5 <split [30000/7500]> Fold5 <tibble> <tibble> <tibble>
# A tibble: 10 × 7
penalty .metric .estimator mean n std_err
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 0.0001 roc_auc binary 0.877 5 0.00318
2 0.000215 roc_auc binary 0.877 5 0.00316
3 0.000464 roc_auc binary 0.877 5 0.00314
4 0.001 roc_auc binary 0.877 5 0.00304
5 0.00215 roc_auc binary 0.877 5 0.00263
6 0.00464 roc_auc binary 0.870 5 0.00253
7 0.01 roc_auc binary 0.853 5 0.00249
8 0.0215 roc_auc binary 0.824 5 0.00424
9 0.0464 roc_auc binary 0.797 5 0.00400
10 0.1 roc_auc binary 0.5 5 0
# … with 1 more variable: .config <chr>
# A tibble: 10 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.001 roc_auc binary 0.877 5 0.00304 Preproce…
2 0.00215 roc_auc binary 0.877 5 0.00263 Preproce…
3 0.000464 roc_auc binary 0.877 5 0.00314 Preproce…
4 0.000215 roc_auc binary 0.877 5 0.00316 Preproce…
5 0.0001 roc_auc binary 0.877 5 0.00318 Preproce…
6 0.00464 roc_auc binary 0.870 5 0.00253 Preproce…
7 0.01 roc_auc binary 0.853 5 0.00249 Preproce…
8 0.0215 roc_auc binary 0.824 5 0.00424 Preproce…
9 0.0464 roc_auc binary 0.797 5 0.00400 Preproce…
10 0.1 roc_auc binary 0.5 5 0 Preproce…
Since we used control_grid(save_pred = TRUE) with tune_grid() we can recover the predictions for the out-of-sample values for each fold:
# A tibble: 37,500 × 7
id .pred_child…¹ .pred…² .row penalty child…³ .config
<chr> <dbl> <dbl> <int> <dbl> <fct> <chr>
1 Fold1 0.366 0.634 5 0.00215 childr… Prepro…
2 Fold1 0.144 0.856 6 0.00215 childr… Prepro…
3 Fold1 0.0542 0.946 19 0.00215 none Prepro…
4 Fold1 0.0266 0.973 21 0.00215 none Prepro…
5 Fold1 0.106 0.894 22 0.00215 childr… Prepro…
6 Fold1 0.0286 0.971 23 0.00215 none Prepro…
7 Fold1 0.0205 0.980 30 0.00215 none Prepro…
8 Fold1 0.0192 0.981 31 0.00215 none Prepro…
9 Fold1 0.0431 0.957 32 0.00215 none Prepro…
10 Fold1 0.0532 0.947 35 0.00215 none Prepro…
# … with 37,490 more rows, and abbreviated variable names
# ¹.pred_children, ².pred_none, ³children
Typically with a tuned model we will refit using the complete test data and the “best” parameter value(s),
lasso_test_perf = lasso_fit %>%
augment(new_data = hotel_test) %>%
select(children, starts_with(".pred"))
lasso_test_perf %>%
conf_mat(children, .pred_class) Truth
Prediction children none
children 330 109
none 681 11380
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.752
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.866
# A tibble: 6 × 2
engine mode
<chr> <chr>
1 ranger classification
2 ranger regression
3 randomForest classification
4 randomForest regression
5 spark classification
6 spark regression
We skip dummy coding in the recipe as it is not needed by ranger,
# A tibble: 5 × 8
mtry min_n .metric .estim…¹ mean n std_err
<int> <int> <chr> <chr> <dbl> <int> <dbl>
1 8 26 roc_auc binary 0.916 5 0.00172
2 4 29 roc_auc binary 0.916 5 0.00190
3 11 7 roc_auc binary 0.914 5 0.00182
4 15 21 roc_auc binary 0.913 5 0.00118
5 17 35 roc_auc binary 0.911 5 0.00191
# … with 1 more variable: .config <chr>, and
# abbreviated variable name ¹.estimator
# A tibble: 1 × 3
mtry min_n .config
<int> <int> <chr>
1 8 26 Preprocessor1_Model06
rf_test_perf = rf_fit %>%
augment(new_data = hotel_test) %>%
select(children, starts_with(".pred"))
rf_test_perf %>%
conf_mat(children, .pred_class) Truth
Prediction children none
children 402 69
none 609 11420
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.854
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.920
Sta 523 - Fall 2022